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Introduction to Gravitational Clustering 

Armen Aghajanyan 


Abstract —The downfall of many supervised learn¬ 
ing algorithms, such as neural networks, is the in- 
herent need for a large amount of training data 
(|Benediktsson et al.L Il993l l. Although there is a lot of 
buzz about big data, there is still the problem of doing 
classification from a small data-set. Other methods 
such as support vector machines, although capable of 
dealing with few samples, are in herently binary clas¬ 
sifiers (ICortes and Vapnikl Il995l l. and are in need of 
learning strategies such as One vs All in the case of 
multi-classification. In the presence of a large number 
of classes this can become problematic. In this paper 
we present, a novel approach to supervised learning 
through the method of cl ustering. Unlike t raditional 
methods such as K-Means (jMacQueenl . [T967t l. Gravita¬ 
tional Clustering does not require the initial number of 
clusters, and automatically builds the clusters, individ¬ 
ual samples can be arbitrarily weighted and it requires 
only few samples while staying resilient to over-fitting. 


Keywords—Machine Learning, Classification, Cluster¬ 
ing. 


I. Introduction 

The name of this algorithm is derived from the 
metaphor that the algorithm was built upon. Each 
cluster is symbolic of a planet,and each planet has a 
mass and a radius as well as the class that it represents. 
But unlike real life planets, our planets are static with 
respect to other planets. The process of training can 
be conceptually thought of as building a universe. The 
process of predicting is simply placing a mass in the 
universe and tracing what planet it will appear on. 

This algorithm exhibits three nice properties: 

1) Ability to learn from a few samples. 

2) Ability to weight the importance of training vectors. 

3) The nature of the algorithm makes it resilient to 
overfitting. 

The ability to weight the importance of training vectors as 
well as the ability to learn from a few samples allows us 
to model a system t hat supports the notion of prototypes, 
e.g. Eleanor Rosch (iLakofil . T i987M P. 41). 

II. Definition 

Let us start by mathematically dehning what each one 
of our symbolic structures will be. The most important 
structure is our cluster or our planet. We will define the 
planet as containing a dynamic mass to, dynamic radius r, 


dynamic position and a static class 9. Mathematically: 


TO e K 

r e K 

e K” 

0 e Z 

P = ,9} 


( 1 ) 


Our universe will simply consist of a set of planets. The 
universe will also hold a couple of global constants. The 
initial radius of a planet that has just been created which 
we will denote with r'. The so called percent step, which 
represents the amount a test mass moves before recalcu¬ 
lating the new forces on the test mass. We will denote this 
with the Greek a. The amount of steps taken or iterations 
will be denoted with /3. The distance between planets will 
be calculated with the function denoted 


III. Training Model 

One of the better aspects of the model is its ability to 
rate your feature vectors. To do so, let us define a hybrid 
feature vector h. 

h = {it, TO, 0} (2) 


The TO variable allows us to rate the value of the feature 
vector. For example if you have a probabilistic diagnosis, 
each feature vector will contain the class of the diagnosis as 
well as the probability of the diagnosis represented by the 
mass. The training is quite simple. Below is the pseudo¬ 
code. 

nearplanets •<— Find Planets in Radius of h x', 
nearplanets •<— nearplanets where Pe = he ; 
if nearplanets is Empty then 
Universe Add Planet 
{to = hm, r = r',lt = hx,9 = he} 
else 

p •<— planet that generates most force S 
nearplanets ; 

Universe update p ^ 

{m=pm + hm,r = m^,lt = x + ^ h x} 

end 

Algorithm 1: Training Algorithm 


The new position is a weighted sum of the two position 
vectors with respect to their weight. 
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A. Asymptotic Analysis 

Our training simply traverses through all of the planets 
in the universe and computes the distance from the train¬ 
ing sample. Saying N is the amount of planets and D is 
the dimensionality of our feature vectors. Assuming that 
the planet exists, we get 

0{D*N) = 0 {N) (3) 

Using a KD-Tree (jBentlevl . 1197511 will allow us to train with 
the average asymptotic of 

0{D * log N) = 0 (log N) (4) 

On the flip side, assuming we have to add the planet: 

0{D*N + Nr^ear) = O {N + N^ear) (5) 

KD-Tree (iBentlevl. Il975h 

0{D*l0gN + Nnear) = O (log N + N^ear) (6) 

This is the asymptotic of adding a single train vector. 
Stating that Ng is the number of samples we end up with 
the final equation being. 

0(iV,(logiV + iV„)) (7) 


B. Comparison of Training Times 



Gravitational 

Clustering 

K-Means 

SVM 

Decision Trees 

Big O 

O (N,{log N + N„)) 

0(n°‘=+i logn) 

0{n^) 

0(n,niog(n.)) 

Online 

Training 

Yes 

Yes 

No 

Partial 

Variant 

Importance 

Yes 

No 

No 

No 


• Nn is synonymous with Nnear 

• Ng is synonymous with NgampUs 


IV. Simulation Testing Model 


Metaphorically, predicting the class of a new point is 
equivalent to dropping a piece of mass into the universe 
and tracing the mass until it collides with a planet. In 
this metaphor, we assume that the planets are infinitely 
small and therefore there will be no interference. Our test 
point will simply be defined as I = { Let us first define 
getting the normalized directional force vector. Recall from 
physics that the gravitational force between two planets is 


F = G 


mim2 

j,2 


( 8 ) 


In our case, we will assume that the mass of each test 
point is equal to every other, therefore we can disregard 
the mass. We can also remove the G constant. Our hybrid 
force equation per planet p is now: 



( 9 ) 


Where r is I x)- We define the total normalized 

force on our test mass with the custom equation. 


Fnet — X] 


Pm *( 77 X ^ x) 

p^Universe 
TT-^net 


( 10 ) 


To restate, a is the percent step taken with respect to the 
force. Now let us describe the simulation algorithm: 


pos <r- lx', 

for i in i [0, 13\ step 1 do 

force •«- V Pm*{tx-pos) 

iUiCfcJ 2-^p^Universe 


5 

norm ^ force ; 

pos ^ pos -I- norm 

end 

nearplanets •«— Find Planets in Radius of pos; 
if nearplanets is not Empty then 
I return niode[nearplanets 9] 
else 

I return [planet closest to pos] 9 

end 


A. Asymptotic Analysis of Simulation Testing Model 

Let us state that N is the number of planets and D is 
the dimensionality of our feature vector. Calculating the 
force takes up 

0{AD*N) = 0{N) (11) 

The 4 comes from the vector arithmetic that needed to 
be done. One subtraction, one multiplication, one distance 
squared, one division. The N term came from the summa¬ 
tion. The total simulation next becomes. 

0{1D*N *P + N) (12) 

The 3 more D terms come from: finding the magnitude, 
multiplying by force (simultaneously multiplying by a) and 
the update summation. The next N came from finding the 
planets with the radius containing pos. We can disregard 
the final if statement since they do not directly affect N. 
We get: 

0{7D*N*P + N) = 0 {N{7D *j3 + l)) = 0 (N) (13) 


V. Probabilistic Non-Simulating Model 

We propose an different method of computing the class 
of the test point, without the need of simulation and 
through purely statistical methods. We first make an 
assumption that a planet or cluster is normally distributed 
from the center and the standard deviation is some func¬ 
tion of the radius of the planet cr{pr). Therefore let us the 
define the probability density function. 


PDFp = 


1 -D(Tx,7x)^ 

-e 2 CT(pr)^ 

271 * a{pr) 


(14) 
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Now to define our prediction equation: 


MAXe 


Universe 

n 


p I Pe=Sn 


1 

2 ?! * <j{pr) 




(15) 


To account for the fact that different classes have different 
amounts of planets, we will transform this function into: 


MAXe[ 


logn 


Universe 

P I pe^On 


/ — X , I x )^ 

Pm,2cr(Pr)^ 


\p\pe = 0n\ 


(16) 


We removed the normalization constant, due to the fact 
that this is a relative measure. The bottom of the fraction 
is the number of planets per class which insures that there 
is no bias due to the different amounts of clusters with 
varying radius’s. The mass term is added to insure that 
greater planets have a greater impact on the rating. 


Through trial and error we found the best function 
for a{pr) was simply 


The asymptotic will simply be 

O {DN) = O (A) 


(17) 


VI. Testing Results 

We tested the alg orithm out on the Wisconsin 
breast cancer d ata-set (|Wolberg and Mangasarianl Il990ll 
(iLichmanl . 1201311 . Below are the results. 


Gravitational Clustering 

r' = 50 
a ^ 0.01 
/3 ^ 100 

r' = 5000 
a ^ 0.001 

0 ^ 1000 

Simulated Model 

89.65% 

90.59% 

Probabilistic Model 

92.78% 

72.41% 


It is interesting to note that the larger the clusters 
and smaller the amount of clusters the less accurate the 
probabilistic model will be. Unless of course the clusters 
perfectly model the data that they encapsulate. 

We continued our testing by comparing the outputs 
of some popular out of the box methods. All the 
other al gorithms were implem ented in the scikit-learn 
library (jPedregosa et al.L 1201 lH . The data- s ets w e used 
were the p opular Iris data -s et (ILichmanl . l20I3ll . dig- 
i ts data-set(Pedregosa et al.l . , Ollivetti data-set 

(|Bevilacaua et al.l . 1200611 . 




Algorithm 




Data-sets 

GC Prob 

GC Sim 

SVM 

(poly) 

SVM 

(rbf) 

Naive Bayes 
(Gaussian) 

Iris 

98.41% 

96.82% 

94.66% 

97.33% 

“96^ 

Digits 

86.95% 

91.04% 

98.99% 

25.61% 

83.85% 

Olivetti 

65.5% 

77.5% 

7.5% 

8.5% 

99.5% 



Accuracy Per Data-set 


Algorithm 

Type 

GC Prob 

GC Sim 

Iris 

93.33% 

92.00% 

Digits 

59.96% 

58.18% 

Olivetti 

63.5% 

53.75% 


VII. Conclusion 

In this paper we introduced a novel technique to clus¬ 
tering and supervised learning that can learn from a few 
samples, while maintaining a low asymptotic run-time 
and inherently allowing for arbitrary sample weighting. 
We compared it to current techniques for classification 
and showed both the strengths of the algorithm as well 
as the weaknesses. From the test results we can infer 
that our algorithm acts consistently in both low and 
high dimensional data, as well as staying consistent in 
a range of multi-class data-sets. All the code written, 
including the tests and the algorithm itself can be found on 
https: / / github.com/ArmenAg/GravitationalClustering/ 

Thank you for reading. 
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To show that our algorithm can handle very few samples, 
we tested the following data-sets again, but this time we 
only used 1 sample per each class as the training data. 
Below are the results. 

























































